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I.  latroductloo 

We  state  the  aoapar2UDetric  oulticlass  classification  problem  as 
follows.  M  classes  are  characterized  by  unknown  probability  distribution 
functions.  A  data  scunple  containing  labelled  vectors  from  each  of  the  M 
classes  is  available.  A  classifier  is  designed  based  on  the  training  sample 
and  evaluated  with  the  test  sample 

Priedmein  [1]  has  recently  introduced  i  2-clas3  recursive  partitioning 
algorithm,  motivated  in  part  by  the  work  of  Anderson  [2] ,  Henderson  and  Fu 
[3],  and  Meisel  and  Michalopoulos  f4].  Friedman's  algorithm  generates  a 
bindary  decision  tree  by  maximizing  the  Kolmogorov-Smirnov  (K-S)  distance 
between  marginal  cumulative  distribution  functions  at  each  node.  In 
practice,  an  estimate  of  the  K-S  distance  based  on  a  training  sample  is 
maximized,  Friedman  suggests  solving  the  M-class  problem  by  solving  M  2- 
class  problems.  The  resulting  classifier  has  M  binary  decision  trees. 

In  this  note  we  give  a  multiclass  recursive  partitioning  algorithm 
which  generates  a  single  binary  decision  tree  for  classifying  all  classes. 
The  algorithm  minimizes  the  3ayes  risk  at  each  node.  In  practice  an 
estimate  of  the  Bayes  risk  based  on  a  training  sample  is  minimized.  We  also 
give  a  tree  termination  algorithm  which  optimally  terminates  binary  decision 
trees.  The  algorithm  yields  the  unique  tree  with  the  fewest  nodes  which 
minimizes  the  Bayes  risk.  In  practice  an  estimate  of  the  Bayes  risk  based 
on  a  test  sample  is  minimized. 

The  research  was  originally  done  in  1981-82  [51.  The  recent  book  of 
Breiman  et  al  [6]  has  elements  in  common  with  this  paper  but  we  believe  the 
approach  presented  here  is  different. 


The  note  is  organized  as  follows.  In  Section  2  we  give  binary  decision 
tree  notation  and  cost  structure  for  our  problem.  In  Section  3  and  4  we 
discuss  tree  generation  and  termination,  respectively. 

II.  Notation 

We  shall  be  interested  in  classifiers  which  can  be  represented  by 
binary  decision  trees.  For  our  purposes,  a  binary  decision  tree  T  is  a 
collection  of  nodes  with  the  structure  shown  in  Fig.  2.1.  The 

levels  of  T  are  ordered  monotonically  as  0,  going  from  bottom  to 

top.  The  nodes  of  T  are  ordered  monotonically  as  1,2,...,E  going  from 
bottom  to  top,  and  for  each  level  from  left  to  right.  We  shall  find  it 

convenient  to  denote  the  subtree  of  T  with  root  ncde  and  whose  terminal 

nodes  are  also  terminal  nodes  of  T  as  T(i>  (see  Fig,  2.1). 

We  associate  a  binary  decision  tree  and  a  classifier  in  the  following 

way.  For  each  node  N^*T  we  have  at  most  five  decision  parameters:  o_2^, 

Sj^,  r^^,  and  c^.  Suppose  ocR^  is  to  be  classified.  The  root  node  N^  is 
where  the  decision  process  begins.  At  Nj^  the  component  of  o  will  be 

used  for  discrimination.  If  o^i  <  the  next  decision  will  be  made  at 
Ng  .•  If  a^i  >  oji^  the  next  decision  will  be  made  at  .  If  is  a 
terminal  node  then  a  is  labelled  class  c^^.  It  is  easily  seen  that  a  binary 
decision  tree  with  these  decision  parameters  can  represent  a  classifier 
which  partitions  into  d-dimensional  intervals.  The  algorithms  we  shall 
discuss  generate  binary  decision  trees  as  partitioning  proceeds. 

is  the  k''^  coordinate  of  a. 
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Let  Hj  be  the  hypothesis  that  the  vector  under  consideration  belongs  to 
the  class,  We  denote  be  the  misclassification  cost  for 

and  the  prior  probability  of  H,.  The  Bayes  risk  (of  misclassification) 

(1  -  Pr (decide  HjjHj}). 

III.  Tree  Generation 

In  this  section  generation  of  binary  decision  trees  is  discussed.  An 
algorithm  is  given  which  generates  a  single  binary  decision  tree  for 
classifying  all  classes.  The  algorithm  minimizes  the  Bayes  risk  at  each 
node.  In  practice  an  estimate  of  the  Bayes  risk  based  on  a  training  sample 
is  miniJ.ized. 

We  first  review  Friedman's  2-cla3s  algorithm.  Friedman's  algorithm  is 
based  on  a  result  of  Stoller's  [5l  concerning  univariate  nonparametric 
classification  (d«l).  We  assume  *  ^"2* 

otcller  solves  the  following  problem:  find  a*  which  minimizes  the 
Bayes  risk  based  on  the  classifier 

o<a*  decide  or  H2 

a>o*  decide  H2  or  Hj 

Let  Fj(a),  F2(a)  be  the  cumulative  distribution  functions  (c.d.f.'s)  for  Hj. 
H2  respectively,  and  let 

D(a)  =  |Fj(a)  -  F^(a)  |  (3.1) 


13  then  given  by  £  2^:1  j 
j=l 


Stoller  shows  that 


V 


o  =  arg  max  D(a) 
a 


(3.2) 


(D(o  )  is  the  Kolmogorov-Snirnov  distance  between  Fj  and  F2).  This 
procedure  can  be  applied  recursively  until  all  intervals  in  the  classifier 
meet  a  termination  criterion.  A  terminal  interval  I  is  then  assigned  the 
class  label 


c*  =  arg  max  PrCaeljH.) 

J=1.2  J 


(3.3) 


Friedman  extends  Stoller's  algorithm  to  the  multivariate  case  (d>2)  by 
solving  the  following  problem:  find  k*  and  a*  which  minimize  the  Bayes  risk 
of  the  classifier 


k*  .  • 

a  <  a 


k*  ^  • 

a  >0 


decide  H,  or  H. 

1  2 


decide  or 


^2,k^®^  marginal  c.d.f.'s  on  coordinate  k  for  H2.H2 

respectively,  and  let 


”k'“>  - 


(3.4) 


In  view  of  (3.2)  we  have 


a  (k)  “  arg  max  D.  (a) 

Cl  iC 


(3.5) 
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=  arg 


Dj^(a*(k)) 


•  •  • 

a  =  o  (k  ) 


As  with  the  univariate  case.  Friedman's  procedure  can  be  applied  recursively 
until  all  (d-dimensional)  intervals  in  the  classifier  meet  a  termination 
criterion.  A  terminal  interval  is  then  assigned  class  label 


« 

c 


=  arg  max  Pr{aeljH  1 
J-1.2  ■  ^ 


(3.6) 


To  apply  Friedman's  algorithm  to  the  nonparametric  classification 
problem  we  must  estimate  Fj^jj(o)  and  PrCaeljHj).  Let  l»***»2l  n  ' 
22, 1' •  •  • '22,n2  training  sample  vectors  where  is  the  i''^  vector 

from  the  class.  Suppose  we  have  arranged  the  sample  such  that  aj  i  S 
® j  2  ^  -  ®j,n  •  Fj^jj(a)  by 


f  0 


1 


a  < 


<  a  <  a 


k 

J,i+1 


a  > 


k 

“j.nj 


and  Pr(a8l(Hj}  by  the  fraction  of  training  sample  vectors  in  class  j  which 
land  in  I. 

Mote  that  Friedman's  algorithm  generates  a  binary  decision  tree  as 
partitioning  proceeds  by  appropriately  identifying  the  decision  parameters 


of  Section  2 


Friedman  extends  his  algorithm  to  the  M-class  case  by  generating  M 
binary  decision  trees,  where  the  tree  discriminates  between  the 

class  and  all  the  other  classes  taken  as  a  group.  We  next  propose  an 
extension  which  has  the  advantage  of  generating  a  single  binary  decision 
tree  for  classifying  all  classes.  At  the  same  time  we  relax  the  constraint 
that  all  the  i^n.'s  are  equal. 

Consider  the  following  problem:  find  the  k*,  a*,  m*  amd  n*  which 

minimize  the  Bayes  risk  based  on  the  classifier 


k  .  • 

a  <  a 


decide  H  •  or  H  * 
m  n 


a  >  a 


decide  H  •  or  H  • 
n  m 


Let 


R  „  .  (o)  =  mind  n  (1-F  .  (a))  +  I  n  F  ,  (a). 

m,n,k  m  m  m,k  n  n  n,k 


“  n,<c  mmm,k 


"■4 

jFm.n  'J 


(3.7) 


Then  it  can  easily  be  shown  that 
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a*(in,n,iic)  =  arg  min  R  ,  (a) 
o  m.n.k 

k*(m,n)  =  arg  min  R  ,  (o*(m,n,k)) 

iC  S  ^  /I  ^  ic 

(m*.n*)  =  arg  min  R  ,  .  (a*(m,n,lc*(m.n) ) ) 

tn<n  m,njK  (m,n} 

.  *  ■  */  * 

<  =  A  (m  .n  ) 

a  =  o  (m  ,n  ,k  )  (3.8) 

Furthermore,  if  •=...=  oinimir-ationa  over  ^(o)  reduce 

to  maximizations  over 


D  ,  (a)  = 
ffl.n,k 


(3.9) 


If  we  now  replace  the  double  maximization  (3.5)  in  Friedman's  aigorithn  with 
the  triple  minimization  (3.8)  we  get  the  proposed  multiclass  recursive 
partitioning  algorithm.  Of  course  (3.6)  should  be  replaced  by 


c 


=  arg  max  i.n  Pr{oel|H  } 
j=l,...M  J  J  ■  J 


(3.10) 


Otherwise  the  algorithms  are  the  same.  In  particular  the  multiclass 
algorithm  generates  a  single  binary  decision  tree  as  partitioning  proceeds 
by  appropriately  identifying  the  decision  parameters  of  Section  2.  Note 
that  m*  and  n*  are  not  decision  parameters. 


IV.  Tree  Termination 
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In  this  section  termination  of  binary  decision  trees  is  discussed.  An 
algorithm  is  given  for  optimally  terminating  a  binary  decision  tree.  The 
algorithm  yields  the  unique  tree  with  fewest  nodes  which  minimizes  the  Bayes 
risk.  In  practice  an  estimate  of  the  Bayes  risk  based  on  a  test  saunple  is 
minimized . 

Suppose  we  generate  a  binary  decision  tree  with  the  multiclass 
recursive  partitioning  algorithm  of  Section  3.  Partitioning  can  proceed 
until  terminal  nodes  only  contain  training  sample  vectors  from  a  single 
c  ass.  In  this  case  the  entire  training  sample  is  correctly  classified. 
But  if  class  distributions  overlap  the  optimal  Bayes  rule  should  nob 
correctly  classify  the  entire  training  sample.  Thus  we  are  led  to  examine 
termination  of  binary  decision  trees. 

Friedman  introduces  a  termination  parameter  k  =  minimum  number  of 
ti’aining  sample  vectors  in  a  terminal  node.  The  value  of  k  is  determined  by 
m;.nimizing  the  Bayes  risk.  In  practice  an  estimate  of  the  Bayes  risk  based 
on  a  test  saunple  is  minimized.  In  the  sequel  we  will  refer  to  the  binary 
decision  tree  with  terminal  nodes  only  containing  training  sample  vectors 
from  a  single  class  as  the  "full"  tree.  What  Friedman's  method  amounts  to 
is  minimizing  the  Bayes  risk  over  a  subset  of  the  subtrees  of  the  full  tree 
with  the  same  root  node.  At  this  point  the  following  question  arises:  is 
there  a  computationally  efficient  method  of  minimizing  the  Bayes  risk  over 
all  subtrees  of  the  full  tree  with  the  same  root  node?  The  answer  is  yes  as 
we  shall  now  show. 

We  first  state  a  certain  combinatorial  problem.  Suppose  we  have  a 
binary  decision  tree  and  with  each  node  of  the  tree  we  associate  a  cost.  We 


define  the  cost  of  each  subtree  as  the  sum  of  the  costs  of  its  terminal 


nodes.  The  problem  is  to  find  the  subtree  with  the  same  root  node  as  the 
original  tree  which  maximizes  cost.  More  precisely,  let  T  =  fN, IK  be  a 

°  j  =1 

binary  decision  tree  with  L  levels  and  K^  nodes  at  level  i  as  described  in 
Section  1,  the  cost  associated  with  node  and  SCT)  the  cost  of  subtree 
T.  Then 


K 

G(T)  =  ^  l^(T)g^  U.l) 

i=l 


where 


1 . IT)  =  < 


is  a  terminal  node  of  T 

else 


.Vow  let  S  oe  the  set  of  subtrees  of  with  the  same  root  node  The 

problem  can  then  be  stated  as: 

K 

max  G(T)  =  Max  ^  ^i^TIg  (4.2) 

TeS  TeS  ,  ^  ^ 

i.=l 

Vext  consider  the  following  simple  algorithm.  Going  from  first  to  last 
level  and  for  each  level  from  left  to  right,  if  deleting  descendents  of 
current  node  does  not.  decrease  cost,  delete  descendents  and  go  to  next  node, 
etc.  In  view  of  (4.1)  the  algorithm  becomes: 
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For  i  =  1,...,  L-1  do: 


For  J  =  Ij  •  • . . 

plf  g  >  G(T  (j): 

T  (J)  <r-  {N.} 


fC^  do: 


Define  T*  =  We  claim  that  T*  solves  (4.2). 

Theorem:  G(T*)  >  G(T)  for  all  TeS. 

Furthermore,  if  Q(T*)  «  G(T)  for  some  TeS,  T»^T*,  then  T*  has  fewer  nodes 
than  T. 

Proof:  See  Appendix. 

Finally,  we  show  that  the  problem  of  minimizing  the  Bayes  risk  over  all 
subtrees  of  the  full  tree  with  the  same  root  node  has  form  (4.2).  Let  T^  be 
the  full  tree  and 

g.  *  1  n  PrfacN, |H  J  i=l,...,K  (4.3) 

1  ®i  ®i  “  i'  Cj_ 

where  is  the  class  label  of  if  becomes  a  terminal  node,  i.e., 

c.  =  arg  max  -  (4.4) 

where  p^j  is  the  fraction  of  training  sample  vectors  in  class  j  which  land 


Then  by  direct  computation  the  Bayes  risk  of  TeS  is  given  by 


in  .V, . 


M  K  M 

R(T)  =  ^  J  ^  -  G(T)  (4„5) 

j=l  i=l  j=l 

Hence,  minimizing  H(T)  is  equivalent  to  maximizing  G(T).  In  practice  an 
estimate  of  R(T)  based  on  a  test  sample  is  miniuized.  In  this  case 

g  =  I  n  q  1  =  (4.6) 

i  C£  ic^ 

where  qj_j  is  the  fraction  of  test  sample  vectors  in  class  j  which  land  in 

Ni. 

APPENDIX 

Proof  of  Theorem  Section  IV:  Let  be  the  set  of  subtrees  of  T^  with 
the  same  root  node  Njj  and  which  only  have  nodes  missing  fx^jm  levels  i- 
(or  equivalently,  every  terminal  node  on  levels  i,...,L-l  is  also  a 
terminal  node  of  T^).  We  shall  say  that  T^^  is  optimal  over  if  the 
theorem  holds  with  T*  and  S  replaced  by  T^^  and  respectively.  We  show 

that  Tj_  is  optimal  over  for  i  =  1,...,L-1.  Since  T*  *  Tj^_j  and  S  *  S^-l 

the  theorem  follows.  We  proceed  by  induction.  T^  is  clearly  optimal  over 
S^.  We  assume  Tj^  is  optimal  over  S^^  and  want  to  show  that  T^^j^  is  optimal 
over  Let  TsSj^^j  and  T  ^  There  are  four  cases  to  consider. 

Suppose  there  exists  a  terminal  node  which  is  a  nonterminal 

node  of  T  and  N,  is  on  some  level  <  i.  Construct  T'eS,.i  from  T  by 
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terminating  T  at  Since  is  a  terminal  node  of  it  is  also  a 

terminal  node  of  and  it  follows  from  (4.1)  and  the  optimality  of  that 
gj  <  G(T(j))  so  that  G(T')  <  G(T),  and  since  T'  has  fewer  nodes  than  T,  T 
cannot  be  optimal  over 

Next,  suppose  there  exists  a  terminal  node  NjsT  which  is  a  nonterminal 
node  of  and  Nj  is  on  some  level  <  i.  Contract  from  T  by 

augmenting  T  with  T^+^CJ)  at  Nj.  Since  =  Tj^(j)  it  follows  from 

(4.1)  and  the  optimality  of  that  G(T'(j))  <  gj  so  that  G(T')  <  G(T),  and 
consequently  T  cannot  be  optimal  over 

Next,  suppose  there  exists  a  terminal  node  ^rtiich  is  a 

nonterminal  node  of  T  and  Nj  is  on  level  i+1.  If  T(J)  =  Tj^(j)  construct 
T’eSj^^l  from  T  by  terminating  T  at  Nj.  Since  gj  <  G(T^(j))  =  G(T(j))  it 
follows  from  (4.1)  that  G(T')  <  G(T),  and  since  T'  has  fewer  nodes  than  T,  T 
cannot  be  optimal  over  If  T(j)  ^  T^(j)  construct  T’eSj^^j^  from  T  by 

replacing  T(J)  with  Tj_(j).  At  this  point  we  essentially  are  in  one  of  the 
preceding  cases  (with  f'®placed  by  T'). 

Finally,  suppose  there  exists  a  terminal  node  N^eT  which  is  a 
nonterminal  node  of  and  Nj  is  on  level  i+1.  Construct  T'eSi^.1  from  T 

by  augmenting  T  with  Tj^^j(j)  at  Mj.  Since  Tj^+j^(j)  =  T^(J)  we  have  gj  > 
G(T^(J))  *  G(Ti+l(j))  =  G(T'(j))  and  it  follows  from  (4.1)  that  G(T)  > 
G(T'),  and  consequently  T  cannot  be  optimal  over  QED 
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